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Safety Requires No Single Points of Failure 

If someone is likely to die due to a system failure, that means the 
system can't have any single points of failure. The implications 
for this are more subtle than they might seem at first glance, 
because a "single point" isn't a single line of code or a single logic 
gate, but rather a "fault containment region," which is often an 
entire chip. 

Safety Critical Systems Can't Have Sin g le Points of 
Failure 


One of the basic tenets of safety critical system design is avoiding 
single points of failure. A single point of failure is a component 
that, if it is the only thing that fails, can make the system unsafe. 

In contrast, a double (or triple, etc.) point of failure situation is 
one in which two (or more) components must fail to result in an 
unsafe system. 

As an example, a single-engine airplane has a single point of 
failure (the single engine - if it fails, the plane doesn’t have 
propulsion power any more). A commercial airliner with two 
engines does not have an engine as a single point of failure, 
because if one engine fails the second engine is sufficient to 
operate and land the aircraft in expected operational scenarios. 
Similarly, cars have braking actuators on more than one wheel in 
part so that if one fails there are other redundant actuators to 
stop the vehicle. 

The objectives of the risk assessment are typically: 

1. To show that no single point of failure within the system 
unsafe state, in particular for the higher Integrity Levels: 

2. To show that the risk of multiple (sequential and simultai 
cause an unsafe state is within an acceptable level. 

MISRA Report 2 states that avoiding such failures is the 
centerpiece of risk assessment. (MISRA Report 2 p. 17) 
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For systems that are deployed in large numbers for many 
operating hours, it can reasonably be expected that any complex 
component such as a computer chip or the software within it will 
fail at some point. Thus, attaining reasonable safety requires 
ensuring that no such component is a single point of failure. 

When redundancy is used to avoid having a single point of 
failure, it is important that the redundant components not have 
any “common mode” failures. In other words, there must not be 
any way in which both redundant components would fail from 
the same cause, because any such cause would in effect be a 
single point of failure. As an example, if two aircraft engines use 
the same fuel pump, then that fuel pump becomes a single point 
of failure if it stops pumping fuel to both engines at the same 
time. Or if two brake pads in a car are closed by pressure from 
the same hydraulic line, then a hydraulic line rupture could cause 
both brake pads to fail at the same time due to loss of hydraulic 
pressure. 

Similarly, it is important that there be no sequences of failures 
that escape detection until a last failure results in a system 
failure. For example, if one engine on a two-engine aircraft fails, 
it is essential to be able to detect that failure immediately so that 
diversion to a nearby airport can be completed before the second 
engine has a chance to fail as well. 

For functions implemented in software, designers must assume 
that software can fail due to software defects missed in testing or 
due to hardware malfunctions. Software failures that must be 
considered include: a task dying or “hanging,” a task missing its 
deadline, a task producing an unsafe value due to a 
computational error, and a task producing an unsafe value due to 
unexpected (potentially abnormal) input values. It is important 
to note that when a software task “dies” that does not mean the 
rest of the system stops operating. For example, a software task 
that periodically computes a servo angle might die and leave the 
servo angle commanded to the last computed position, 
depending upon whether other tasks within the system notice 
whether the servo angle computation task is alive or not. 

Avoiding single point failures in a multitasking software system 
requires that a single task must not manage both normal 
behavior and failure mode backup behavior, because if a failure is 
caused by that task dying or misbehaving, the backup behavior is 
likely to be lost as well. Similarly, a single task should not both 
perform an action and also monitor faults in that action, because 
if the task gets the action wrong, one should reasonably expect 
that it will subsequently fail to perform the monitoring correctly 
as well. 

It is well known in the research literature that avoiding single 
points of failure is essential in a safety critical system. Leveson 
says: “To prove the safety of a complex system in the presence of 
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faults, it is necessary to show that no single fault can cause a 
hazardous effect and that hazards resulting from sequences of 
failures are sufficiently remote.” (Leveson 1986, pg. 139) 

Storey says, in the context of software safety: “it is important to 
note that as complete isolation can never be guaranteed, system 
designers should not rely on the correct operation of a single 
module to achieve safety, even if that module is well trusted. As 
in all aspects of safety, several independent means of assuring 
safety are preferred.” (Storey 1996, pg. 238) In other words, a 
fundamental tenet of safety critical system design is to first 
ensure that there are no single points of failure, and then ensure 
that there are no likely pairs or sequences of failures that can also 
cause a safety problem. A specific implication of this statement is 
that software within a single CPU cannot be trusted to operate 
properly at all times. A second independent CPU or other device 
must be used to duplicate or monitor computations to ensure 
safety. Storey gives a variety of redundant hardware patterns to 
avoid single point failures (id., pp. 131-143) 

Douglass says: 
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3.3.3 Single-Point Failures 

Devices ought to be safe when there are no faults and the device i: 
properly. Most experts consider a device safe, however, only wht 
single-point failure cannot lead to an incident. That is, the failure i 
single component or the failure of multiple components due to ar 
gle failure event should not result in an unsafe condition. 6 

For example, consider total software control on a single CF 
a patient ventilator. What happens if the CPU locks up? What i 


5. Been there, done that. 

6. Whether or not the system must consider multiple independent faults in it 
analysis depends on the risk. If the faults are sufficiently likely and the damage 
tial sufficiently high, then multiple-fault scenarios must be considered. Nucleai 
plants fall into this category. 


corrupts memory that contains the executable code or the comma 
tidal volume and breathing rate? What if the ventilator loses pc 
What if the gas supply fails? What if a valve sticks open or closed? 

Given that an untoward event can happen to a component 
must consider the effect of its failure on the safety of the system, 
only means of controlling hazards in the software-controlled vent 
is to raise an alarm on the ventilator itself, then the means of cc 
may be inadequate. How can a stalled CPU also raise an alarm t 
the user's attention to the hazard? This fault, a stalled CPU, affects 
the primary action (ventilation) and the means of hazard cc 
(alarming). This is a common mode failure —that is, a failure in mu 
control paths due to a common or shared fault. 


For safety analysis, you cannot consider the probability of fc 
for the single fault. Regardless of how remote the chance of faili 
safe system continues to be safe in the event of any single-point fa 
This has broad implications. Consider a watchdog circuit in a ca 

Douglass on single-point failures (Douglass 1999, pp. 105-107) 

Douglass makes it clear that all single point failures must be 
addressed to attain safety, no matter how remote the probability 
of failure might be. 

Moreover, random faults are just one type of fault that is 
expected on a regular, if infrequent basis. “Random faults, like 
end-of-life failures of electrical components, cannot be designed 
away. It is possible to add redundancy so that such faults can be 
easily detected, but no one has ever made a CPU that cannot fail. ” 
(Douglass 1999, pg. 105, emphasis added) 

Fault Containment Re g ions Count As "Sin g le Points" of 
Failure 


Additionally, it is important to recognize that any line of code on 
a CPU (or even many lines of code on a CPU) can be a single 
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point fault that creates arbitrary effects in diverse places 
elsewhere on that same CPU. Addy makes it clear that: “all parts 
of the software must be investigated for errors that could have a 
safety impact. This is not a part of the software safety analysis 
that can be slighted without risk, and may require more 
emphasis than it currently receives. [Addy’s] case study suggests 
that, in particular, care should be taken to search for errors in 
noncritical software that have a hidden interface with the safety- 
critical software through data and system services.” (Addy 1991, 
pg 83). Thus, it is inadequate to look at only some of the code to 
evaluate safety - you have to look at every last line of code to be 
sure. 

A Fault Containment Region (FCR) is a region of a system which 
is neither vulnerable to external faults, nor causes faults outside 
the FCR. (Lala 1994, p. 30) The importance of an FCR is that 
arbitrary faults that originate inside an FCR cannot spread to 
another FCR, providing the basis of fault containment within a 
system that has redundancy to tolerate faults. In other words, if 
an FCR can generate erroneous data, you need a separate, 
independent FCR to detect that fault as the basis for creating a 
fault tolerant system (id.) 



Hammet presents architectures that are fail safe, such as a “self¬ 
checking pair with simplex fault down” that in my experience is 
also commonly used in rail signaling equipment. (Hammet 
2002, pp. 19-20). Rail signaling systems I have worked with also 
use redundancy to avoid single points of failure. So it's no 
mystery how to do this — you just have to want to get it right. 

Obermaisser gives a list of failure modes for a “fault containment 
region” (FCR) that includes “arbitrary failures,” (Obermaisser 
2006 pp. 7-8, section 3.2) with the understanding that a safe 
system must have multiple redundant FCRs to avoid a single 
point of failure. Put another way, a fault anywhere in a single 
FCR renders the entire FCR unreliable. Obermaisser also says 
that “The independence of FCRs can be compromised by shared 
physical resources (e.g., power supply, timing source), external 
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faults (e.g., Electromagnetic Interference (EMI), spatial 
proximity) and design.” (id., section 3.1 p. 6) 

Kopetz says that an automotive drive-by-wire system design 
must “assure that replicated subsystems of the architecture for 
ultra-dependable systems fail independently of each other.” 
(Kopetz 2004, p. 32, emphasis per original) Kopetz also says that 
each FCR must be completely independent of each other, 
including: computer hardware, power supply, timing source, 
clock synchronization service, and physical space, noting that “all 
correlated failures of two subsystems residing on the same silicon 
die cannot be eliminated.” (id., p. 34). 

Continuing on with automotive specific references, none of this 
was news to the authors of the MISRA Software Guidelines, who 
said the objective of risk assessment is to “show that no single 
point of failure within the system can lead to a potentially unsafe 
state, in particular for the higher Integrity Levels.” (MISRA 
Report 2,1995, p. 17). In this context, “higher Integrity levels” 
are those functions that could cause significant unsafe behavior. 
That report also says that the risk from multiple faults must be 
sufficiently low to be acceptable. 

Mauser reports on a Siemens Automotive study of electronic 
throttle control for automobiles (Mauser 1999). The study 
specifically accounted for random faults (id. p. 732), as well as 
considering the probability of a “runaway” incidents (id., p. 734). 
It found a possibility of single point failures, and in particular 
identified dual redundant throttle electrical signals being read by 
a single shared (multiplexed) analog to digital converter in the 
CPU (id., p. 739) as a critical flaw. 

Ademaj says that “independent fault containment regions must 
be implemented in separate silicon dies.” (Ademaj 2003, p. 5) In 
other words, any two functions on the same silicon die are 
subject to arbitrary faults and constitute a single point of failure. 

Because of the further 
miniaturization of future submicron integrated circuits, 
transient and intermittent faults will happen more 
frequently and transient and intermittent failures will in 
general not be contained within a region of a single die. 

This means that independent fault containment regions 
must be implemented in separate silicon dies. 


FCRs must be implemented in separate silicon die (Ademaj 

2003, p. 5) 

But Ademaj didn’t just say it - he proved it via 
experimentation on a communication chip specifically 
designed for safety critical automotive drive-by-wire applications 
(id., pg. 9 conclusions), and those results required the designers 
of the TTP protocol chip (based on the work of Prof. Kopetz) to 
change their approach to achieving fault tolerance to the use of a 
Star topology because combining a network CPU with the 
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network monitor on the same silicon die was proven to be 
susceptible to single points of failure even though the die had 
been specifically designed to physically isolate their monitor 
from their Main CPU. Even though every attempt had been made 
for on-chip isolation, two completely independent circuits 
sharing the same chip were observed to fail together from a 
single fault in a safety-critical automotive drive-by-wire design. 

In other words, if a safety critical system has a single point of 
failure - any single point of failure, even if mitigated with 
“failsafes” in the same fault containment region - then it is not 
only unsafe, but also flies in the face of well accepted practices for 
creating safe systems. 

In a future post I'll get into the issue of how critical a system has 
to be for addressing all single point failures to be mandatory. But 
the short version is that is someone is likely to die as a result of 
system failure, you need to avoid all single point failures (every 
last one of them) on at least the level of FCRs. 
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A previous posting described sources of random hardware faults. 
This posting shows that even single bit errors can lead to 
catastrophic system failures. Furthermore, these types of faults 
are difficult or impossible to identify and reproduce in system- 
level testing. But, just because you can't make such errors 
reproduce at the snap of your fingers doesn't mean they aren't 
affecting real systems every day. 

Paraphrasing a favorite saying of my colleague Roy Maxion: "You 
may forget about unlikely faults, but they won't forget about 
you." 

Random Faults Cause Catastrophic Errors 

Random faults can and do occur in hardware on an infrequent 
but inevitable basis. If you have a large number of embedded 
systems deployed you can count on this happening to your 
system. But, how severe can the consequences be from a random 
bit flip? As it turns out, this can be a source of system-killer 
problems that have hit more than one company, and can cause 
catastrophic failures of safety critical systems. 

While random faults are usually transient events, even the most 
fleeting of faults will at least sometimes result in a persistent 
error that can be unsafe. “Hardware component faults may be 
permanent, transient or intermittent, but design faults will 
always be permanent. However, it should be remembered that 
faults of any of these classes may result in errors that persist 
within the system.” (Storey 1996, pg. 116) 

Not every random fault will cause a catastrophic system failure. 
But based on experimental evidence and published studies, it is 
virtually certain that at least some random faults will cause 
catastrophic system failures in a system that is not fully protected 
against all possible single-point faults and likely combinations of 
multi-point faults. 

Addy presented a case study of an embedded real time control 
system. The system architecture made heavy use of global 
variables (which is a bad idea for other reasons (Koopman 
2010)). Analysis of test results showed that a single memory bit 
written to a global variable caused an unsafe action for the 
system studied despite redundant checks in software. (Addy 
1991, pg. 77). This means that a single bit flip causing an unsafe 
system has to be considered plausible, because it has actually 
been demonstrated in a safety critical control system. 
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4.1 Single Bit Over w ri te CqM C ause Safety 
Error 

Fault analysis of the program revealed that a single bit 
indicated the operator had taken the last action to initiate a 
particular safety-critical function. This data bit was within the 
scope of all processes. Although reversal of the safety-critical 
function was still possible after the operator's action, the 
function would proceed unless positive action was taken to stop 
it. 

4.3 Identified Inadvertent Memory Overwrites 

Problems were identified where memory locations other than 
the intended location were set. None of these overwrites were 
to the last-action bit; indeed, only one of the overwrites was 
to a memory location that had an impact on safety. However, 
these problems demonstrated that memory overwrites actually 
occurred in the control system program. 


(Addy 1991 , pp. 77 , 78 ; analysis of 
industrial real-time control system 
bug reports) 

Addy also reported a fault in which one process set a value, and 
then another process took action upon that value. Addy identified 
a failure mode in the first process of his system due to task death 
that rendered the operator unable to control the safety-critical 
process: 

The design of the program placed the software that interfaced 
with the operator display in one process, and the software that 
performed some of the safety-critical functions in the other 
processes. If the former process were halted due to an 
improper or unidentified trap, the operator would be unable to 
stop the safety-critical function using software control. In fact, 
since the operator display would be frozen, the operator might 
believe the safety-critical functions were no longer being 
performed. (There were alternate mechanical controls by 
which the safety-critical functions could be stopped, if the 
operator was aware of the situation.) 

Excerpt from (Addy 1991, pg. 80) 

In Addy’s case a display is frozen and control over the safety 
critical process is lost. As a result of the experience, Addy 
recommends a separate safety process. (Addy 1991 pg. 81). 

Sun Microsystems learned the Single Event Upset (SEU) lesson 
the hard way in 2000, when it had to recall high-end servers 
because they had failed to use error detection/correction codes 
and were suffering from server crashes due to random faults in 
SRAM. (Forbes, 2000) Fortunately no deaths were involved since 
this is a server rather than a safety critical embedded system. 
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But Intel's Pentium problem involved a flaw that was mostly theoretical; it 
was hard to find a customer who could show that the error affected any 
computation. The Sun glitch, by contrast, has caused crashes at dozens of 
customer sites. 

The problem involves "cache" memory chips, which store the most 
frequently needed code for instant access. In May, after months of 
struggling to identify the cause, Sun found it had been shipping servers 
whose cache modules contained faulty S-RAM (static random access 
memory) chips from a supplierit won’t name. 

The faulty chips are easily disrupted by stray radiation, alpha particles or 
cosmic rays. The trouble occurs at the bit level—a one turns into a zero, or 
vice versa. When the computer detects an error in memory, it shuts down 
and reboots itself. High altitude, high temperatures and other factors can 
contribute to the problem. 


"You can run tests for a long time, and the problem doesn't happen. Then 
you put the machine back into its environment, and you get the problem. It 
took us months just to figure out what was going on,"says Shoemaker. 


Engineers have long known that memory chips can be disrupted by 
radiation and other environmental factors. That is why Hewlett-Packard and 
IBM use error-correcting code (ECC), which detects cache errors and 
restores bits that were changed by mistake. 

Sun servers lack ECC protection. "Frankly, we just missed it. It's something 
we regret at this point,"Shoemaker says. Its next high-end servers, based 
on a new processor called the UltraSparc III, will have it; they are to come 
| out in mid-2001. 

Excerpt from Forbes 2000. 

Cisco also had a significant SEU problem with their 12000 series 
router line cards. They went so far as to publish a description and 
workaround notification, stating: “Cisco 12000 line cards may 
reset after single event upset (SEU) failures. This field notice 
highlights some of those failures, why they occur, and what work 
arounds are available.” (Cisco 2003) Again, these were not safety 
critical systems. 

More recently, Skarin & Karlsson presented a study of fault 
injection results on an automotive brake-by-wire system to 
determine if random faults can cause an unsafe situation. They 
found that 30% of random errors caused erroneous outputs, and 
of those 15% resulted in critical braking failures. (Skarin & 
Karlsson 2008) In the context of this brake-by-wire study, 
critical failures were either loss of braking or a locked wheel 
during braking, (id., p. 148) Thus, it is clear that these sorts of 
errors can lead to critical operational faults in electronic 
automotive controls. It is worth noting that random errors tend 
to increase dramatically with altitude, making this a relevant 
environmental condition (Heijman 2011, p. 6) 

Mauser reports on a Siemens Automotive study of electronic 
throttle control for automobiles (Mauser 1999). The study 
specifically accounted for random faults (id. p. 732), as well as 
considering the probability of a “runaway” incidents (id., p. 734). 
It found a possibility of single point failures, and in particular 
identified dual redundant throttle electrical signals being read by 
a single shared (multiplexed) analog to digital converter in the 
CPU (id., p. 739) as a critical flaw. 
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For automotive safety applications, MISRA Software Guidelines 
discuss the need to identify and correct corruption of memory 
values (MISRA Software Guidelines p. 30). 

Random Faults Can Be Impossible to Reproduce In 
General 


Intermittent faults from SEUs and other sources can and do 
occur in the field, causing real problems that are not readily 
reproducible even if the cause is a hard failure. In many cases 
components with faults are diagnosed as “Trouble Not 
Identified” (TNI). In my work with embedded system companies 
I have learned that is common to have TNI rates of 50% for 
complex electronic components across many embedded 
industries. (This means that half of the components returned to 
the factory as defective are found to be working properly when 
normal testing procedures are used.) While it is tempting to 
blame the customer for incorrect operation of equipment, the 
reality is that many times a defect in software design, hardware 
design, or manufacturing is ultimately responsible for these 
elusive problems. Thomas et al. give a case study of a TNI 
problem on Ford vehicles and conclude that it is crucial to 
perform a safety critical electronic root cause analysis for TNI 
problems rather than just assume it is the customer’s fault for not 
using the equipment properly or falsely reporting an error. 
(Thomas 2002) 

11' no defect can be 
determined and there is no other verifiable and auditable 
explanation, then the module must still be considered a 
field failure, due to the elusive and intermittent nature of 

A manufacturer should assume that all field returns 
are field failures, unless some alternative reason can be 
verified. In fact, any company that produces a safety or 
emissions regulated product should assume that every 
complaint or return of that product is a failure, and take 
^M^jhdl^res^onsibihlvJoT^ascena^^ 

It must not be assumed that a returned module that 
passes tests associated with an engineering specification 
is good. 

The TNIs associated with Ford's TFI involved sev¬ 
eral manufacturing, electrical, and thermal mechanical 
factors that contributed to the existence of the failures, 
as opposed to the existence of a single Achilles’ heel. 

TNI reports must be tracked down to root cause. (Thomas 2002, 

pg■ 650) 

The point is not really the exact number, but rather that random 
failures are the sort of thing that can be expected to happen on a 
daily basis for a large deployed fleet of safety-critical systems, 
and therefore must be guarded against rigorously. Moreover, lack 
of reproducibility does not mean the failure didn’t happen. In 
fact, irreproducible failures and very-difficult-to-reproduce 
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failures are quite common in modern electronic components and 
systems. 

Random Faults Are Even Harder To Reproduce In 
S ystem-Level Testin g 

While random faults happen all the time in large fleets, detecting 
and reproducing them is not so easy. For example: “Studies have 
shown that transient faults occur far more often than permanent 
ones and are also much harder to detect.” (Yu 2003 pg. 11) 

It is also well known that some software faults appear random, 
even though they may be caused by a software defect that could 
fixed if identified. Yu explains that this is a common situation: 

Software faults are always the consequence of an incorrect design, at the 
specification or at the coding time. Every software engineer knows that a 
software product is bug free only until the next bug is found. Many of these 
faults arc latent in the code and show up only during operations, especially 
under the heavy or unusual workloads and timing contexts. 

Since software faults arc a result of a bad design, it might be supposed 
that all software faults would be permanent. Interestingly, practice shows 
that despite their permanent nature, their behaviors arc transient; that is, 
when a bad behavior of a system occurs, it cannot be observed again, even if 
a great care is taken to repeat the situation in which it occurred. Such 
behavior is commonly called a failure of the system. The subtleties of the 
system state may mask the fault, as when the bug is triggered by very 
particular timing relationships between several system components, or by 
some other rare and irrcproduciblc situations. Curiously, most computer 
failures are blamed on either software faults or permanent hardware faults, to 
the exclusion of the transient and intermittent hardware types. Yet many 
studies show these types are much more frequent than permanent faults. The 
problem is that they arc much harder to be tracked down. 

Software faults can be very difficult to track down (Yu 2003, p. 

12) 

Random faults cannot be assumed to be benign, and cannot be 
assumed to always result in a system crash or system reset that 
puts the system in to a safe state. “Hardware component faults 
may be permanent, transient or intermittent, but design faults 
will always be permanent. However, it should be remembered 
that faults of any of these classes may result in errors that persist 
within the system.” (Storey 1996, p. 116) “Many designers believe 
that computers fail safe, whereas NASA experience has shown 
that computers may exhibit hazardous failure modes.” (NASA 
2004, p. 21) 

Moreover, random faults are just one type of fault that is 
expected on a regular, if infrequent basis. “Random faults, like 
end-of-life failures of electrical components, cannot be designed 
away. It is possible to add redundancy so that such faults can be 
easily detected, but no one has ever made a CPU that cannot 
fail” (Douglass 1999, pg. 105, emphasis added) 

MISRA states that “system design should consider both random 
and systematic faults.” (MISRA Software Guidelines p. 6). It also 
states “Any fault whose initiating conditions occur frequently will 
usually have emerged during testing before production, so that 
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the remaining obscure faults may appear randomly, even though 
they are entirely systematic in origin.” (MISRA Report 2 p. 7). In 
other words, software faults can appear random, and are likely to 
be elusive and difficult to pin down in products - because the 
easy ones are the ones that get caught during testing. 

SEUs and other truly random faults are not reproducible in 
normal test conditions because they depend upon rare 
uncontrollable events (e.g., cosmic ray strikes). “The result is an 
error in one bit, which, however, cannot be duplicated because of 
its random nature.” (Tang 2003, pg. 111). 

While intermittent faults might manifest repeatedly over a long 
series of tests with a particular individual vehicle or other 
embedded system, doing so may require precise reproduction of 
environmental conditions such as vibration, temperature, 
humidity, voltage fluctuations, operating conditions, and other 
factors. 


By their very nature intermittent faults are often very difficult to 

detect and remove, as the fault detection process must coincide with the existence 
of the fault. Unfortunately, many permanent faults appear to be intermittent 
because their effects are only apparent at certain times. For example, a software 
synchronization fault is permanent, as its code is always present, but its execution 
will only sometimes result in an error, depending on timing considerations. Such 
faults, having the characteristics of intermittent faults, are similarly difficult to 
locate. 

Hardware component failures may be permanent, transient or inter¬ 
mittent, but design faults will always be permanent. However, it should be 
remembered that faults of any of these classes may result in errors that persist 
within the system. 

Random faults may be difficult to track down , but can result in 
errors that persist within the system (Storey 1996, pp. 115-116) 

It is unreasonable to expect a fault caused by a random error to 
perform upon demand in a test setup run for a few hours, or even 
a few days. It can easily take weeks or months to reproduce 
random software defects, and some may never be reproduced 
simply by performing system level testing in any reasonable 
amount of time. If a fleet of hundreds of thousands of vehicles, 
or other safety critical embedded system, only sees random faults 
a few times a day, testing a single vehicle is unlikely to see that 
same random fault in a short test time - or possibly it may never 
see a particular random fault over that single test vehicle’s entire 
operating life. 

Thus, it is to be expected that realistic faults - faults that happen 
to all types of computers on an everyday basis when large 
numbers of them are deployed - can’t necessarily be reproduced 
upon demand in a single vehicle simply by performing vehicle- 
level tests. 

Software Faults Can Also A p pear To Be Random and 
Irreproducible In System-Level Testin g 

Software data corruption includes changes during operation 
made to RAM and other hardware resources (for example, 
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registers controlling I/O peripherals that can be altered by 
software). Unlike hardware data corruption, software data 
corruption is caused by design defects in software rather than 
disruption of hardware operating state. The defects maybe quite 
subtle and seemingly random, especially if they are caused by 
timing problems (also known as race conditions). Because 
software data corruption is caused by a CPU writing values to 
memory or registers under software control just as the hardware 
would expect to happen with correctly working software, 
hardware data corruption countermeasures such as error 
detecting codes do not mitigate this source of errors. 

It is well known that software defects can cause corruption of 
data or control flow information. Since such corruption is due to 
software defects, one would expect the frequency at which it 
happens depends heavily on the quality of the software. 

Generally developers try to find and eliminate frequent sources 
of memory corruption, leaving the more elusive and infrequent 
problems in production code. Horst et al. used modeling and 
measurement to estimate a corruption would occur once per 
month with a population of 10,000 processors (1993, abstract), 
although that software was written to non-safety-critical levels of 
software quality. Sullivan and Chillarege used defect reports to 
understand “overlay errors” (the IBM term for memory 
corruption), and found that such errors had much higher impact 
on the system than other types of defects (Sullivan 1991, p. 9). 

MISRA recommends using redundant data or a checksum 
(MISRA Software Guidelines p. 30; MISRA report 1 p. 21) NASA 
recommends using multiple copies or a CRC to guard against 
variable corruption (NASA 2004, p. 93). IEC 61508-3 
recommends protecting data structures against corruption (p. 

72). 

It is also well known that some software faults appear random, 
even though they may be caused be a software defect that could 
fixed if identified. Yu explains that this is a common situation, as 
shown in a previous figure. (Yu 2003, p. 12) It is well known that 
software faults can appear to be irreproducible at the system 
level, making it virtually impossible to reproduce them upon 
demand with system level testing of an unmodified system. 

It reasonable to expect that any software - and especially 
software that has not been developed in rigorous accordance with 
high-integrity software procedures - to have residual bugs that 
can cause memory corruption, crashes or other malfunctions. 

The figure below provides an example software malfunction from 
an in-flight entertainment system. Such examples abound if you 
keep an eye out for them. 
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Aircraft entertainment system software failure. February 13, 
2008 (photo by author) 


It should be no surprise if a system suffers random faults in a 
large-scale deployment that are difficult or impossible to 
reproduce in the lab via system-level testing, even if their 
manifestation in the field has been well documented. There 
simply isn't enough exposure to random fault sources in 
laboratory testing conditions to experience to the variety of faults 
that will happen in the real world to a large fleet. 


Coming at this another way, saying that something can't happen 
because it can't be reproduced in the lab ignores how random 
hardware and software faults manifest in the real world. 
Arbitrarily bad random faults can and will happen eventually in 
any widely deployed embedded system. Designers have to make a 
choice between ignoring this reality, or facing it head-on to 
design systems that are safe despite such faults occurring. 
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Monday, March 3, 2014 

Random Hardware Faults 


Computer hardware randomly generates incorrect answers once 
in a while, in some cases due to single event upsets from cosmic 
rays. Yes, really! Here's a summary of the problem and how often 
you can expect it to occur, with references for further reading. 


(As will be the case with some of my posts over the next few 
months, this text is written more in an academic style, and 
emphasizes automotive safety applications since that is the area I 
perform much of my research in. But I hope it is still useful, if 
less entertaining in style than some of my other posts.) 

Hardware corruption in the form of bit flips in memories has 
been known to occur on terrestrial computers for more than 30 
years (e.g., May & Woods 1979; Karnik et al. 2004, Heijman 
2011). Hardware data corruption includes changes during 
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operation made to both RAM and other CPU hardware resources 
(for example bits held in CPU registers that are intermediate 
results in a computation). 

"Soft errors" are those in which a transient fault causes a loss of 
data, but not damage to a circuit. (Marianio3, pg. 51) 
Terminology can be slightly confusing in that a “soft error” has 
nothing to do with software, but instead refers to a hardware data 
corruption that does not involve permanent chip damage. These 
“soft” errors can be caused by, among other things, neutrons 
created by cosmic rays that penetrate the earth’s atmosphere. 

Soft errors can be caused by external events such as naturally 
occurring alpha radiation particle caused by impurities in the 
semiconductor materials, as well as neutrons from cosmic rays 
that penetrate the earth’s atmosphere (Mariani03, pp. 51-53). 
Additionally, random faults may be caused by internal events 
such as electrical noise and power supply disruptions that only 
occasionally cause problems. (Mariani03, pg. 51). 

As computers evolve to become more capable, they also use 
smaller and smaller individual devices to perform computation. 
And, as devices get smaller, they become more vulnerable to data 
corruption from a variety of sources in which a subtle fleeting 
event causes a disruption in a data value. For example, a cosmic 
ray might create a particle that strikes a bit in memory, flipping 
that bit from a “1” to a “o,” resulting in the change of a data value 
or death of a task in a real time control system. (Wang 2008 is a 
brief tutorial.) 

Chapter 1 of Ziegler & Puchner (2004) gives an historical 
perspective of the topic. Chapter 9 of Ziegler & Puchner (2004) 
lists accepted techniques for protecting against hardware 
induced errors, including use of error correcting codes (id., p. 9- 
7) and memory scrubbing (id., p. 9-8). It is noted that techniques 
that protect memory “are not effective for computational or logic 
operations” (id., p. 9-8), meaning that only values stored in RAM 
can be protected by these methods, although special exotic 
methods can be used to protect the rest of the CPU (id., p. 1-13). 

Causes for such failures vary in source and severity depending 
upon a variety of conditions and the exact type of technology 
used, but they are always present. The general idea of the failure 
mode is as follows. SRAM memory uses pairs of inverter logic 
gates arranged in a feedback pattern to “remember” a “1” or a “o” 
value. A neutron produced by a cosmic ray can strike atoms in an 
SRAM cell, inducing a current spike that over-rides the 
remembered value, flipping a o to a 1, or a 1 to a o. Newer 
technology uses smaller hardware cells to achieve increased 
density. But having smaller components makes it easier for a 
cosmic ray or other disturbance to disrupt the cell’s value. While 
a cosmic ray strike may seem like an exotic way for computers to 
fail, this is a well documented effect that has been known for 
decades (e.g., May & Woods 1979 discussed radiation-induced 
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errors) and occurs on a regular basis with a predictable average 
frequency, even though it is fundamentally a randomly occurring 
event. 


A Single Event Effect (SEE) is when a highly energetic particle 
(neutron), present in the environment, strikes sensitive regions of 
an electronic device disrupting its correct operation 



Radiation strike causing transistor disruption (Gorini 2012). 


A paper from Dr. Cristian Constantinescu, a top Intel 
dependability expert at that time, summarized the situation 
about a decade ago (Constantinescu 2003). He differentiates 
random faults into two cases. Intermittent faults, which might be 
due to a subtle manufacturing or wearout defect, can cause 
bursts of memory errors (id., pp. 15-16). Transient faults, on the 
other hand, can be caused by naturally occurring neutron and 
alpha particles, electromagnetic interference, and electrostatic 
discharge (id., pg. 16), which he refers to as “soft errors” in 
general. To simplify terminology, we’ll consider both transient 
and intermittent faults as “random” errors, even though 
intermittent errors strictly speaking may come in occasional 
bursts rather than having purely random independent statistical 
distributions. His paper goes on to discuss errors observed in real 
computers. In his study, 193 corporate server computers were 
studied, and 69% of them observed one or more random errors 
in a 16-month period. 2.6 percent of the servers reported more 
than 1000 random errors in that 16-month period due to a 
combination of random error sources (id., pg. 16): 


Fault and error data provided by a reliabil¬ 
ity' study is useful in exemplifying the specifics 
of intermittent faults. Specially designed soft¬ 
ware collected error logs from 193 informa¬ 
tion technology production servers. Figure 2 
shows the number of memory single-bit cor¬ 
rected errors reported over a period of 16 
months. In this sample, 60 machines (31 per¬ 
cent of the population) experienced no errors 
at all, 81 servers (42 percent) reported 
between one and five errors, and so on. 
Although most of the systems experienced a 
few errors, there were five servers (2.6 percent) 
that reported over 1,000 errors. 


90, 



1 


Figure 2. Histogram of the ni 
errors reported by 193 systei 


Random Error Data (Constantinescu 2003, pp. 15-16). 
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This data supports the well known notion that random errors can 
and do occur in all computer systems. It is to be expected due to 
minute variations in manufacturing, environments, and 
statistical distribution properties that some systems will have a 
very large number of such errors in a short period of time, while 
many of the same exact type of system will exhibit none even for 
extended periods of operation. The problem is only expected to 
get worse for newer generations of electronics (id., pg. 17). 


Nowadays, “soft errors” i.e. an undcsired behavior of a memory 1 
sequential clement in a digital component due to transient fai 
officially considered one of the main causes of loss of reliability in 
digital components, as clearly stated in The International Tec 
Roadmap for Semiconductors 2001: “Below 100 nm, single-even 
(soft errors) severely impact field-level product reliability, not c 
memory, but for logic as well.” and “The trend to greater integr 
dynamic, asynchronous and AMS/RF circuits increases vulncral 
noise, crosstalk and soft error”. And confirmed again also in the 2001 
of the Roadmap: “Since the operating voltage decreases 20% per tec 
node, increasing noise sensitivity is becoming a big issue in the di 
functional devices (e.g., bits, transistors, gates) and products ( 
DRAMs or MPUs). This is becoming more evident due to low» 
headroom especially in low-power devices, coupled interconnects, 
and ground bounce in the supply voltage, thermal impact on 
off-currents and interconnect resistivities, mutual inductance, s 
coupling, single-event upset (alpha particle), and increased use of < 
logic families. Consequently, modeling, analysis and estimation 1 
performed at all design levels.”. 

Therefore, manufacturers are starting to plan real action to redu 
potential effects also for terrestrial applications. Automotive is one 
application fields where soft errors arc becoming more and more irr 
Today’s vehicles host many microelectronics systems to improve eft 
safety, performance and comfort, as well as information and enterte 
These systems are in general Electronic Computing Unit (ECU) bas‘ 
16 and 32 bit CPUs with a considerable amount of memories, and 
interconnected using robust networks. These networks remove the 1 
the thousands of costly and unreliable wires and connectors used to 1 

a wiring loom. It is the “x-by-wire” revolution, that it trans 
automotive components, once the domain of mechanic or hydrau 
truly distributed electronic systems [BREZ_01] [LF.EN 02J [BERG_ 
course such systems must be highly reliable, as failure will cau 
accidents, and therefore soft errors must be taken into account. 

Excerpt from Mariani 2003, pg. 50 

Mariani (2003, pg. 50) makes it clear that soft errors “are 
officially considered one of the main causes of loss of reliability in 
modern digital components.” He even goes so far to say of “x-by- 
wire” automotive systems (which includes “throttle-by-wire” and 
other computer-controlled functions): “Of course such systems 
must be highly reliable, as failure will cause fatal accidents, and 
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therefore soft errors must be taken into account.” Additionally, 
random faults may be caused by internal events such as electrical 
noise and power supply disruptions that only occasionally cause 
problems. (Mariani 2003, pg. 51). MISRA Report 2 makes it clear 
that “random hardware failures” must be controlled for safety 
critical automotive applications, with a required rigor to control 
those failures increasing as safety becomes more important. 
(MISRA Report 2 p. 18). 

Semiconductor error rates are often given in terms of FIT 
(“Failures in Time”), where one FIT equals one failure per billion 
operating hours. Mariani gives representative numbers as 4000 
FIT for a CPU circa 2001 (Mariani03, pg. 54), with error rates 
increasing significantly for subsequent years. Based on a review 
of various data sources considered it seems reasonable to use an 
error rate of lOOO FIT/Mbit (1 errors per iei3 bit-hr) for SRAM, 
(e.g., Heijman 2011 pg. 2, Wang 2008 pg. 430, Tezzaron 2004) 
Note that not all failures occur in memory. Any transistor on a 
chip can fail due to radiation, and not all storage is in RAM. As 
an example, registers such as a computer’s program counter are 
exposed to Single Event Upsets (SEUs), and the effect of an SEU 
there can result in the computer having arbitrarily bad behavior 
unless some form of mitigation is in place. Automotive-specific 
publications state that hardware induced data corruption must 
be considered (Seeger 1982). 

SEU rates vary considerably based on a number of factors such as 
altitude, with data presented for Boulder Colorado showing a 
factor of 3.5 to 4.8 increase in cosmic-ray induced errors over 
data collected in Essex Junction Vermont (near sea level) 
(O’Gorman 1994, abstract). 

Given the above, it is clear that random hardware errors are to be 
expected when deploying a large fleet of embedded systems. This 
means that if your system is safety critical, you need to take such 
errors into account. In a future post I'll explore how bad the 
effects of such a fault can be. 
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The Therac 25 : A Case Study in Unsafe Software 


Many of you are writing software that has safety aspects creeping 
into it. There's nothing like a real-world case study to bring home 
both the consequences of unsafe software. The Therac 25 story is 
must-read material. The short version is several radiation 
therapy patients were killed by massive radiation overdoses that 
trace back to bad software. See below for more details... 


(This article is written in an academic style rather than as an 
informal blog post. But hopefully it's informative.) 

The Therac 25 accidents form the basis for what is often 
considered the best-documented software safety case-study 
available. The experience illustrates a number of principles that 
are vital to understanding how and why the design and analysis 
of safety-critical systems must be done in a methodical way 
according to established principles. The Therac 25 accidents 
came at a time before many current practices were widespread, 
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and serve as a cautionary tale for why such practices exist and are 
essential to creating safe systems. 

Briefly, the Therac 25 was a medical radiation therapy machine 
that was supposed to deliver controlled doses of radiation to 
cancer patients. Basically, this was a “radiation-by-wire” system 
in which software was used to replace some hardware safety 
mechanisms. Due to software defects, among other factors, it was 
involved in six known massive overdose accidents resulting in 
deaths and serious injuries. (Leveson 1993, p. 18). A simple 
explanation of the likely mechanism for the accidents was using a 
beam strength for x-ray exposure, but without the electron beam 
to X-ray metallic beam-attenuating conversion target target in 
place, resulting in toox over-doses. Due to limitations of the dose 
measurement system, the way patients knew they were over¬ 
exposed was radiation burns (and, in at least one case, a reported 
sizzling sound of the radiation dose measurement devices frying). 



(Source: ComputingCases.Org) 


Some of the characteristics of the Therac 25 development process 
are summarized as: almost all testing was done at the system 
level rather than as lower level unit tests, shared memory 
variables are unprotected from concurrency defects, and “race 
conditions due to multitasking without protecting shared 
variables played an important part in the accidents.” (id., text box 
pp 20-21) Operators were taught that there were “so many safety 
mechanisms” that it was “virtually impossible to overdose a 
patient.” (id., p. 24) 

The manufacturer could not reproduce an initially reported 
problem involving the Ontario Cancer Foundation mishap in 
1985. After analysis, they blamed a patient turntable position 
measurement sensor. A sensor modification and a software 
failsafe were added to mitigate the problem, (id. pg. 23-26) and 
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the manufacturer claimed a five order of magnitude safety 
improvement. But this was not an accurate assessment. 

Later, after the 1986 East Texas Cancer Center accidents, two 
manufacturer engineers could not reproduce a malfunction 
indication reported by the local staff. The manufacturer’s “home 
office engineer reportedly explained that it was not possible for 
the Therac-25 to overdose a patient.” But this was found to be 
untrue after an investigation into a second overdose a month 
later at the same facility revealed the problem to be a software 
defect, (id., pp. 27-28) Reproducing the effects of the software 
defect was difficult, because it was timing-dependent and 
involved the speed of radiation prescription data entry, (id., p. 

28) A number of hardware and software mitigations were added 
(id., pp. 31-32). But even then, an entirely different timing- 
dependent software problem emerged to cause the Yakima Valley 
1987 overdose mishap (id., pp. 33-34), and potentially another 
mishap. 

At a technical level, some of the factors that contributed to the 
Therac 25 accidents included: cryptic error messages, using a 
home-brew real time operating system, mutex operations that 
were not atomic (and therefore, defective), race conditions 
between user inputs and machine actions, a problem that only 
manifested when a counter value rolled over to zero, and 
generally inadequate testing and reviews. 

When used for treatment, the machines were known to throw lots 
of error codes. But instead of this being seen as a sign that the 
machines were exercising the safety mechanisms often (which is 
a really bad idea), this was interpreted as being safe due to all the 
shutdowns. In reality, a system that exercises its failsafes all the 
time is prone to eventually seeing a fault that gets past the 
failsafes. It's well known in operating safety-critical systems that 
exercising failsafes is undesirable. They are your last line of 
defense, and should be a last resort backup that is almost never 
activated. Regularly exercising failsafes is a hallmark of an unsafe 
system. 

The lessons from the Therac 25 form bedrock principles for the 
safety critical software community. They include: “Accidents are 
seldom simple - they usually involve a complex web of 
interacting events with multiple contributing technical, human, 
and organizational factors.” (id., p. 38). Do not assume that 
fixing a particular error will prevent future accidents (“There is 
always another software bug”) (id.). Higher level system 
engineering failures are often relevant, such as: lack of follow- 
through on all reported incidents, overconfidence in the software, 
less-than-acceptable software engineering practices, and 
unrealistic risk assessments (which for the Therac 25 included an 
assessment that the software was defect-free), (id.). 

“Designing any dangerous system in such a way that one failure 
can lead to an accident violates basic system-engineering 
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principles. In this respect, software needs to be treated as a single 
component.” (id. pp. 38-39). (In context, this refers to the 
software resident on a single CPU, meaning that if any software 
defect on one CPU can cause an accident, that is a single point 
failure that renders the system unsafe.) 

Leveson lists “basic software-engineering practices that 
apparently were violated with the Therac-25” as: documentation 
should not be an afterthought; software quality assurance 
practices and standards should be established; designs should be 
kept simple; ways to get error information should be designed in; 
and that “the software should be subjected to extensive testing 
and formal analysis at the module and software level: system 
testing alone is not adequate.” (id., p. 39) Leveson finishes by 
saying that although this was a medical system, “the lessons 
apply to all types of systems where computers control dangerous 
devices.” (id., p. 41) 

Reference: 

Leveson, An investigation of the Therac-25 Accidents, IEEE 
Computer, July 1993, pp. 18-41. (updated version 
here: http://sunnyday.mit.edu/papers/therac.pdf) 

Other reading: 

Therac-25 Case materials for teaching: 

http://www.computingcases.org/case_materials/therac/therac_ 

case_intro.html 

CMU18-649 Software safety lecture (second half covers Therac 

25) 

Better Embedded System Software. Chapter 28 is on software 
safety. 
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